This project aims to cluster Yelp Dataset restaurants from 10 metropolitan cities of North America in contiguous groups of geo-spatial locations. Then the insight into the interest of customers that review restaurants of a particular cluster is used to indicate supply and demand proportions of various categories of restaunts.
Summary¶
This project aims to detect Supply/Demand patterns of restaurants with various categories of food based on customer reviews of the restaurants. The aim is to predict the category of a restaurant (classification) simply based on the location and interests of the customers. However, it is not a straighforward merging of the customers, reviews and businesses data to try various classifiers and predict the categories. The customer reviews data must be conditioned to respresent their interest in particular categories of food in individual localities of restaurant businesses to acheive classification, which can be further extrapolated to Supply vs. Demand paradigm. Here are the steps that are taken:
- Pick top 20 categories of food that restaurants offer as focus for analysis
- Use unsupervised machine learning algorithms (DBSCAN) to geospatially cluster restaurants
- For each customer, aggregate their reviews for restaurants of each category, treating it as their intrest or demand for that type of restaurant.
- For each cluster, group restaurants by category to calculate aggregate supply of restaurants of each category
- For each cluster, aggregate demand by summing up restaurant demands calculated in step 3 specific to customers of the that cluster of restaurants only
- Merge aggregated data for each cluster to individual restaurants
- Run Classification algorithms to find accuracy of category prediction for each category
- Add new unlabbled restaurants to each cluster by supply/demand ratio, predict their categories and indicate ratio on the map
- Validate clusters in a known area to see supply/demand of individual categories
### Index Please click links to jump to the specific area:
Import Libraries
Attribute Analysis - For Clustering
Attribute Selection - For Clustering
Data Clustering
User Data Preparation
Mapping Data Preparation
Classification Model Evaluation
Data VisualizationPlease click links below to access interactive diagrams and maps:
DBSCAN Min. Neighbors & Distance vs Coverage
DBSCAN Min. Neighbors & Distance vs Cluster Count
DBSCAN Min. Neighbors & Distance vs Largest Cluster Size
DBSCAN Label Count Histogram for Min. Neighbors & Distance
DBSCAN Min. Neighbors & Distance vs CoverageNorth America Clustered Restaurants by Location (All Categories)
All Clustered Restaurants on Sketch (Toronto)
All Clustered Restaurants on Map (Toronto)
Slider Controlled Categories Displaying Demand (Toronto)
Import required libraries
For some of the thirs party libraries, you may have to run 'pip install' commands e.g.
- pip install geopy
- pip install shaply
- pip install matplotlib
- pip install plotly
- pip install cufflinks
- pip install swifter
#pk.eyJ1IjoiZjhheml6IiwiYSI6ImNqb3plOWp6MjA0bXIzcnFxczZ1bjdrbmwifQ.5qd5W4B06UUZc20Jax12OA
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time, plotly.plotly as py, plotly.graph_objs as go
from joblib import Parallel, delayed
from ipywidgets import FloatProgress
import matplotlib.pyplot as plt
import multiprocessing
from IPython.core.display import display, HTML
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *
from plotly import tools
import cufflinks as cf
from collections import Counter
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
init_notebook_mode(connected=True)
def __progressbar(ticks):
__bar = FloatProgress(min=0, max=len(ret_series))
display(__bar)
return __bar
Read data from the file
The following files from Yelp Dataset will be used:
- yelp_academic_dataset_business.json : Business location coordinates, locations and categories etc.
- yelp_academic_dataset_review.json : User reviews, related business ids etc.
Rest of the data from the dataset is not useful for the purposes of this project.
business_file = "yelp_academic_dataset_business.json"
review_file = "yelp_academic_dataset_review.json"
start_time = time.time()
df_business_data_full = pd.read_json(business_file, lines=True)
df_review_data_full = pd.read_json(review_file, lines=True)
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
Reveal datatypes and sample data rows from the DataSet
1. Business data datatypes and attributes
df_business_data_full.info()
df_business_data_full.head(3)
2. Review data types and attributes
df_review_data_full.info()
df_review_data_full.head(3)
Reduce the attribute list to only the useful information for this project.
- From business data, [ business_id, latitude, longitude, city, state, postal_code and categories ]
- From review data, [ business_id, review_id, user_id ]
start_time = time.time()
business_cols = ['business_id', 'latitude', 'longitude', 'is_open', 'city', 'neighborhood', \
'state', 'postal_code', 'stars', 'categories']
review_cols = ['business_id', 'review_id', 'user_id', 'cool', 'funny', 'stars', 'useful']
df_business_data = df_business_data_full.filter(business_cols , axis=1)
df_review_data = df_review_data_full.filter(review_cols , axis=1)
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
df_business_data.to_pickle('df_business_data.pkl')
df_review_data.to_pickle('df_review_data.pkl')
display(df_business_data.head(3))
display(df_review_data.head(3))
For the scope of this project, filter down to the businesses located within US/Canada and the ones that have been categorized. This will eliminate noise and small number of restaurants that have not been categorized. Limiting the data to US/Canada will help fit it within North American Map coordinates while retaining majority of the data.
Note: Business categories could be inferred based on user reviews, however, that is outside the scope of this project
start_time = time.time()
print(df_business_data.shape)
north_american_state_provinces = ['AK', 'AL', 'AR', 'AS', 'AZ', 'CA', 'CO', 'CT', 'DC', \
'DE', 'FL', 'GA', 'GU', 'HI', 'IA', 'ID', 'IL', 'IN', \
'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', \
'MP', 'MS', 'MT', 'NA', 'NC', 'ND', 'NE', 'NH', 'NJ', \
'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', \
'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', \
'WI', 'WV', 'WY','AB', 'BC', 'MB', 'NB', 'NL', 'NT', \
'NS', 'NU', 'ON', 'PE', 'QC', 'SK', 'YT']
for idx, row in df_business_data.iterrows():
if row['state'] not in north_american_state_provinces:
df_business_data.drop(idx, inplace=True)
df_business_data = df_business_data[df_business_data['categories'].notnull()]
df_business_data.to_pickle('df_business_data.pkl')
print(df_business_data.shape)
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
df_business_data.head()
Filter down to the list of businesses that are categorized as Restaurants
- Cleanup and build list of all categories
- Filter down to rows containing Restaurants as category
- Display number of rows before and after
start_time = time.time()
# create copy so that original business_data is intact
df_business_categorized_data = df_business_data.copy()
df_business_categorized_data['categories'] = df_business_data['categories'] \
.map(lambda x : (list(map(str.strip, x.split(',')))))
print('Total data rows and columns:{}'.format(df_business_categorized_data.shape))
df_restaurants = df_business_categorized_data[df_business_categorized_data['categories'] \
.map(lambda x : 'Restaurants' in x)]
print('Restaurant data rows and columns:{}'.format(df_restaurants.shape))
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
Since we have dropped all rows that don't have Restaurant as category, dataframe must be re-indexed to fill the gaps.
df_restaurants = df_restaurants.reset_index(drop=True)
df_restaurants.head()
Display list of unique states. We will use one of these states to cluster at state level to make sense of the clustered data
# list of states included in the dataset
df_restaurants.state.unique()
Top 20 Categories: Combine categories into a list and count top 50. Top 20 categories that represent actual food categories will be used for analysis.
start_time = time.time()
all_categories = df_restaurants['categories'].sum()
ct = Counter(all_categories)
top_50_categories = [x[0] for x in list(ct.most_common(50))]
print('Time taken: {:,.2f} seconds'.format(time.time()-start_time))
print(top_50_categories)
Since we are calculating demand for specific categories of food. To limit the scope of this project, we choose top 20 specific categories of foods that the businesses belong to:
- Sandwiches
- American (Traditional)
- Pizza
- Burgers
- Italian
- Mexican
- Chinese
- American (New)
- Japanese
- Chicken Wings
- Seafood
- Sushi Bars
- Canadian (New)
- Asian Fusion
- Mediterranean
- Steakhouses
- Indian
- Thai
- Vietnamese
- Middle Eastern
top_20_specific_categories = ['Sandwiches', 'American (Traditional)', 'Pizza', \
'Burgers', 'Italian', 'Mexican', 'Chinese', \
'American (New)', 'Japanese', 'Chicken Wings', \
'Seafood', 'Sushi Bars', 'Canadian (New)', \
'Asian Fusion', 'Mediterranean', 'Steakhouses', \
'Indian', 'Thai', 'Vietnamese', 'Middle Eastern']
# save top 20 categories for later sections
pd.DataFrame(top_20_specific_categories, columns=['categories']).to_pickle('df_top_20_specific_categories.pkl')
len(top_20_specific_categories)
Category Reduction: Reduce categories of each business to only include top 20 categories. All categories other than the top 20 selected above are removed for optimization since they are not useful for the purpose of this analysis.
for idx, row in df_restaurants.iterrows():
categories = row['categories']
new_categories = list(set(categories) & set(top_20_specific_categories))
df_restaurants.at[idx, 'categories'] = new_categories
# remove restaurants that don't have one of these categories
df_restaurants = df_restaurants[df_restaurants['categories'].astype(str) != '[]']
print(len(df_restaurants))
df_restaurants.head()
Getting Dummies Create one column per category within the dataframe with value 1 if that category applies to the business, 0 otherwise. It uses similar approach to Get Dummies which is often used in pandas for optmization.
df_category_flags = pd.DataFrame(0, index=np.arange(len(df_restaurants)), \
columns=top_20_specific_categories)
for index, row in df_restaurants.iterrows():
for category in row['categories']:
df_category_flags.at[index, category] = 1
restaurant_category_count = pd.DataFrame(df_category_flags.sum(), columns=['Count'])
display(restaurant_category_count)
Replace category list for each restaurant with binary flag for each category within restaurants data
df_restaurants_flagged = df_restaurants.join(df_category_flags)
print(len(df_restaurants_flagged))
df_restaurants_flagged.head()
Save flagged restaurants to easily load for analysis later
df_restaurants_flagged.to_pickle('df_restaurants_flagged.pkl')
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_supply_indicator_by_category = df_restaurants_flagged.filter(top_20_specific_categories).sum()
df_supply_indicator_by_category.to_pickle('df_supply_indicator_by_category.pkl')
display(df_supply_indicator_by_category.to_frame('Supply (Restaurant Count)'))
Define parameters for DB SCAN clustering algorithm
- epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
- min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
Define parameters for DB SCAN clustering algorithm
- epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
- min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
kms_per_radian = 6371.0088
epsilon = 0.5 / kms_per_radian
min_neighbors = 4
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
df_population_size_compare = pd.DataFrame(0, index=range(0,255), \
columns=['Minimum Neighbors','Epsilon(m)','Coverage','Count'])
start_mn = 3
end_mn = 20
start_eps = 50
end_eps = 1500
start_time = time.time()
indx = 0
for mn in range(start_mn,end_mn+1):
for e in range(start_eps, end_eps+50, 50):
eps = e/1000/kms_per_radian
dbscn = DBSCAN(eps=eps, min_samples=mn, algorithm='ball_tree', metric='haversine') \
.fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
cluster_coverage = sum(dbscn.labels_ >= mn)
cluster_count = sum(np.unique(dbscn.labels_) >= mn)
df_population_size_compare.at[indx, 'Minimum Neighbors'] = mn
df_population_size_compare.at[indx, 'Epsilon(m)'] = e
df_population_size_compare.at[indx, 'Coverage'] = cluster_coverage
df_population_size_compare.at[indx, 'Count'] = cluster_count
indx = indx + 1
print("Completed mn:{} e:{} in {:,.2f} seconds".format(mn, e, time.time() - start_time))
df_population_size_compare.head()
Cluster data using DB SCAN algorithm
df_population_size_compare.to_pickle('df_population_size_compare.pkl')
df_population_size_compare = pd.read_pickle('df_population_size_compare.pkl')
df_population_size_compare['Compression'] = 100 * (1 - df_population_size_compare['Count']/df_population_size_compare['Coverage'])
df_population_size_compare.head()
df_population_size_compare = pd.read_pickle('df_population_size_compare.pkl')
x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Coverage'
x_start = 3
x_end = 20
max_x = []
max_y = []
max_z = []
max_2x = []
max_2y = []
max_2z = []
for i in range(x_start, x_end+1):
# figure out the peak values line
df = df_population_size_compare[df_population_size_compare[x_col] == (i)].reset_index(drop=True)
max_row = df[df[z_col] == df[z_col].max()]
max_x.append(max_row[x_col].values[0])
max_y.append(max_row[y_col].values[0])
max_z.append(max_row[z_col].values[0])
# find the second peak line
df = df_population_size_compare[df_population_size_compare[x_col]==i].reset_index(drop=True)
peak2df = df[df[y_col] <= 300]
max_2row = peak2df[peak2df[z_col] == peak2df[z_col].max()]
max_2x.append(max_2row[x_col].values[0])
max_2y.append(max_2row[y_col].values[0])
max_2z.append(max_2row[z_col].values[0])
x = df_population_size_compare[x_col].values
y = df_population_size_compare[y_col].values
z = df_population_size_compare[z_col].values
traces = []
traces.append(go.Scatter3d(
x=x,
y=y,
z=z,
mode='markers',
marker=dict(
size=6,
color=z,
colorscale='Jet',
opacity=0.8
),
showlegend=True,
name='Coverage'
))
# draw max line for z values
traces.append(go.Scatter3d(
z=max_z,
y=max_y,
x=max_x,
line=dict(
color='teal',
width = 4
),
mode='lines',
name='Max Counts Line'
))
# draw 2nd peak line for z values
traces.append(go.Scatter3d(
z=max_2z,
y=max_2y,
x=max_2x,
line=dict(
color='purple',
width = 4
),
mode='lines',
name='2nd Max Counts Line'
))
layout = go.Layout(
margin=dict(
l=0,
r=0,
b=50,
t=50
),
paper_bgcolor='#999999',
title='Clustered Points Coverage vs. Minimum Neighbors & Distance (meters)',
scene=dict(
camera = dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-.25),
eye=dict(x=1.25, y=1.25, z=1.25)
),
xaxis=dict( title= x_col),
yaxis=dict( title= y_col),
zaxis=dict( title= z_col)
),
font= dict(color='#ffffff')
)
fig = go.Figure(data=traces, layout=layout)
display(HTML('<a id="mn_e_coverage">DBSCAN Min. Neighbors & Distance vs Coverage</a>'))
iplot(fig, filename='clusters-scatter')
From the 3D Scatter Heat Plot above, we can observe that the cluster Coverage (total number of points clustered) is inversly proportional to Minimum Neighbors count where it is maximized at mn=x=3, whereas, it has 2 peaks on Maximum Distance Epsilon(e) axis. First peak is at 550 meters, second peak is between values of 50 to 350 for values of Minimum Neighbors less than 6.
To further narrow down to ideal parameters, we will look at Ribbon Plot for values of Minimum Neighbors X-axies and Maximum Distance Epsilon Y-axis against number of clusters that resulted from clusterings Count.
The aim is to narrow down to a range where cluster count is maximized.
df_population_size_compare = pd.read_pickle('df_population_size_compare.pkl')
x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Count'
x_start = 3
x_end = 20
y_start = 0
y_end = 30
traces = []
max_x = []
max_y = []
max_z = []
for i in range(x_start, x_end+1):
x = []
y = []
z = []
ci = int(255/18*i) # "color index"
df = df_population_size_compare[df_population_size_compare[x_col] == (i)].reset_index(drop=True)
max_row = df[df[z_col] == df[z_col].max()]
max_x.append(max_row[x_col].values[0])
max_y.append(max_row[y_col].values[0])
max_z.append(max_row[z_col].values[0])
max_x.append(max_row[x_col].values[0] + 0.5)
max_y.append(max_row[y_col].values[0])
max_z.append(max_row[z_col].values[0])
for j in range(y_start, y_end):
x.append([i, i+.5])
y.append([df.loc[j,y_col], df.loc[j,y_col]])
z.append([df.loc[j,z_col], df.loc[j,z_col]])
traces.append(dict(
z=z,
x=x,
y=y,
colorscale=[ [i, 'rgb(255,%d,%d)'%(ci, ci)] for i in np.arange(0,1.1,0.1) ],
showscale=False,
type='surface'
))
# draw max line for z values
traces.append(go.Scatter3d(
z=max_z,
y=max_y,
x=max_x,
line=dict(
color='green',
width = 8
),
mode='lines',
name='Max Counts Line'
))
layout = go.Layout(
autosize=True,
height=500,
margin=go.layout.Margin(
l=0,
r=0,
b=0,
t=50,
pad=0
),
paper_bgcolor='#999999',
title='Clustered Ribbons of Cluster Count vs. on Minimum Neighbors & Distance (meters)',
scene=dict(
camera = dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-.25),
eye=dict(x=1.5, y=1.5, z=1.5)
),
xaxis=dict( title= x_col),
yaxis=dict( title= y_col),
zaxis=dict( title= z_col)
),
font= dict(color='#ffffff')
)
fig = { 'data':traces, 'layout': layout }
display(HTML('<a id="mn_e_count">DBSCAN Min. Neighbors & Distance vs Cluster Count</a>'))
iplot(fig, filename='ribbon-plot-python')
The Ribbon Chart above shows that number of clusters grows inversly proprotional to number of Minimum Neighbors. It peaks around mn = 3. For Maximum Distance to include locations within a cluster Epsilon, cluster count peaks between values of 50 to 350.
We can observe that there is a convergence from both graphs (Coverage & Count) for ranges:
Minimum Neighbors : 3 - 6
Epsilon(e) : 50 - 350 meters
We will investigate only these ranges from here onwards.
df_population_dist_compare = pd.DataFrame(None, index=range(0,28), \
columns=['Minimum Neighbors','Epsilon(m)','Min','Max', 'Labels'])
start_mn = 3
end_mn = 6
start_eps = 50
end_eps = 350
start_time = time.time()
indx = 0
for mn in range(start_mn,end_mn+1):
for e in range(start_eps, end_eps+50, 50):
eps = e/1000/kms_per_radian
dbscn = DBSCAN(eps=eps, min_samples=mn, algorithm='ball_tree', metric='haversine') \
.fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
df = pd.DataFrame(dbscn.labels_, columns=['label'])
df_counts = df.groupby(['label']).size().reset_index(name='count')
df_counts = df_counts[(df_counts['label'] > -1) & (df_counts['count'] >= mn)]
labels = [x for x in dbscn.labels_ if x != -1] # all labels except -1
df_population_dist_compare.at[indx, 'Minimum Neighbors'] = mn
df_population_dist_compare.at[indx, 'Epsilon(m)'] = e
df_population_dist_compare.at[indx, 'Min'] = df_counts['count'].min()
df_population_dist_compare.at[indx, 'Max'] = df_counts['count'].max()
df_population_dist_compare.at[indx, 'Labels'] = labels
indx = indx + 1
print("Completed mn:{} e:{} in {:,.2f} seconds".format(mn, e, time.time() - start_time))
df_population_dist_compare.head()
df_population_dist_compare.to_pickle('df_population_dist_compare.pkl')
df_population_dist_compare = pd.read_pickle('df_population_dist_compare.pkl')
x_col,y_col,z_col = 'Minimum Neighbors','Epsilon(m)','Max'
x = df_population_dist_compare[x_col].values
y = df_population_dist_compare[y_col].values
z = df_population_dist_compare[z_col].values
zmin = df_population_dist_compare[z_col].min()
zmax = df_population_dist_compare[z_col].max()
intensity = (df_population_dist_compare[z_col].values - zmin)/(zmax-zmin)
traces = []
traces.append(
go.Mesh3d(
x = x,
y = y,
z = z,
intensity = z,
opacity=0.6,
colorscale = 'Earth',
reversescale=True
)
)
layout = go.Layout(
title='Largest Cluster vs. Min Neighbors and Epsilon',
paper_bgcolor='#999999',
scene = dict(
camera = dict(
up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=-.25),
eye=dict(x=-2, y=-.8, z=0.3)
),
xaxis=dict( title= x_col),
yaxis=dict( title= y_col),
zaxis=dict( title= z_col)
),
font= dict(color='#ffffff')
)
fig = go.Figure(data=traces, layout=layout)
display(HTML('<a id="mn_e_largest_cluster">DBSCAN Min. Neighbors & Distance vs Largest Cluster Size</a>'))
iplot(fig, filename='max-3d-mesh')
df_population_dist_compare = pd.read_pickle('df_population_dist_compare.pkl')
numCols = 4
fig = tools.make_subplots(rows=7, cols=4)
idx = 0
for index, row in df_population_dist_compare.iterrows():
trace = go.Histogram(
x = row['Labels'],
name = "mn:{}<br>e:{}" \
.format(row['Minimum Neighbors'], row['Epsilon(m)'])
)
i,j = idx // numCols + 1, idx % numCols + 1
fig.append_trace(trace, i, j)
idx = idx + 1
fig['layout']['xaxis' + str(idx)]['tickformat'] = 's'
fig['layout']['yaxis' + str(idx)]['tickformat'] = 's'
fig['layout']['paper_bgcolor'] = '#999999'
fig['layout']['font']['color'] = '#ffffff'
fig['layout']['font']['size'] = 9
fig['layout']['xaxis']['tickformat'] = 's'
fig['layout']['yaxis' + str(idx)]['tickformat'] = 's'
display(HTML('<a id="mn_e_histograms">DBSCAN Label Count Histogram for Min. Neighbors & Distance</a>'))
iplot(fig, filename='binning function')
Based on the histograms drawn above, Teal hisgtogram with Minimum Neighbor distance of 100 meters and Epsilon(e) value of 4 would be our parameters of choice due to the following reasons:
- In Cluster Count Ribbon Graph, it is on the maximum curve. It will provide highest number of clusters for minimum 4 neighbors
- In Coverage Scatter Graph, it is well above mn=100 and e=5 and rest of the value, only below outliers (which would potentially include noise)
- It is in the lower (earth) range of the surface graph which indicates that maximum count of businesses in a cluster will be minimized.
- Its histogram is least skewed for e=4 values, which means that its clusters would be more evenly distributed than higher e value.
- We will not be selecting e=3 even though it has most evenly distributed histograms because it will not optimize number of cluster.
Define parameters for DB SCAN clustering algorithm
- epsilon: [ 100 meters ] We are setting 100 meters as the distance limit for a neghboring business to be included within a particular cluster. It means that, as long as, there are businesses within 100 meters of each other, they will keep getting included within the same cluster.
- min_neighbors: [ 4 ] Least number of businesses within 100 meters of one another to declare them a cluster. We will eliminate clusters with less number of businesses than min_neighbors threshold to reduce noise.
epsilon = 0.1 / kms_per_radian
min_neighbors = 4
start_time = time.time()
df_restaurants_flagged = pd.read_pickle('df_restaurants_flagged.pkl')
dbscn = DBSCAN(eps=epsilon, min_samples=min_neighbors, algorithm='ball_tree', metric='haversine') \
.fit(np.radians(df_restaurants_flagged[['latitude','longitude']].values))
cluster_labels = dbscn.labels_
print(dbscn)
num_clusters = len(set(cluster_labels))
message = ' Total points clustered: {:,} \n Number of clusters: {:,} \n Compression ratio: {:.1f}% \n Time taken: {:,.2f} seconds'
print(message.format(len(df_restaurants_flagged), num_clusters, \
100*(1 - float(num_clusters) / len(df_restaurants_flagged)), time.time()-start_time))
fd_cluster_labels = pd.DataFrame(cluster_labels, columns=['label'])
print('Number of labels:{}'.format(len(cluster_labels)))
fd_cluster_labels.to_pickle('fd_cluster_labels.pkl')
fd_cluster_labels.head()
# Join cluster labels with the original dataset of the restaurants
df_restaurants_labeled = df_restaurants_flagged.join(pd.DataFrame(fd_cluster_labels))
# Filter out clusters that do not qualify requirements of minimum neighbors
df_rst_lbl_grouped = df_restaurants_labeled.groupby(['label']).size().reset_index(name='count')
df_lbl_counts = df_rst_lbl_grouped[(df_rst_lbl_grouped['label'] > -1) \
& (df_rst_lbl_grouped['count'] >= min_neighbors)].set_index('label')
# Remove all restaurants that were not labeled
df_restaurants_label_filtered = df_restaurants_labeled.join(df_lbl_counts, on='label', how='inner')
df_restaurants_labeled.to_pickle('df_restaurants_labeled.pkl')
print(len(df_restaurants_label_filtered))
df_restaurants_label_filtered.to_pickle('df_restaurants_label_filtered.pkl')
df_restaurants_label_filtered.head()
df_reviews_and_restaurants = df_review_data.join(df_restaurants_label_filtered.set_index('business_id'), \
on='business_id', how='inner', lsuffix='Review ')
print(len(df_reviews_and_restaurants))
# import data
df_bus_reviews = df_reviews_and_restaurants.set_index('business_id')
df_review_data = pd.read_pickle('df_review_data.pkl')
df_restaurants_label_filtered = pd.read_pickle('df_restaurants_label_filtered.pkl')
top_20_specific_categories = pd.read_pickle('df_top_20_specific_categories.pkl')['categories'].values
Group each user's review for each category restaurants. Higher the count of reviews for a certain category, more the user is likely to visit that category of restaurant.
df_user_categories_only = df_reviews_and_restaurants[np.append(top_20_specific_categories, "user_id")]
df_user_rst_visits = df_user_categories_only.groupby(['user_id']).sum()
df_user_rst_visits.to_pickle('df_user_rst_visits.pkl')
df_user_rst_visits.head()
Restaurant/Review Count Ratio: The more the users are reviewing a particular category restaurants, the more they are interested in eating that particular kind of food. Thus overall review count of a restaurant category indicates the interest of users in that category of food and restaurants.
Overall in entire population, equilibrium should exist between review count indicating desire (let's call it Demand Indicator) of a particular restaurant's food type and number of restaurants reviewed of that category that cater to that demand (Supply Indicator).
We can calculate the ratio of the number of restaurants to the number of reviews of each category to find out the ratio by which user interest translates into restaurant count of that category in overall population.
df_demand_indicator_by_category = df_user_rst_visits.sum()
df_demand_indicator_by_category.to_frame('Demand (Review Count)')
review_restaurant_ratio = df_supply_indicator_by_category/df_demand_indicator_by_category
df_restaurant_review_ratio = review_restaurant_ratio.to_frame('Supply/Demand (Restaurant/Review) Ratio')
df_restaurant_review_ratio
Save supply/demand ratio indicator for each category of the restaurant
gb = df_restaurants_label_filtered.groupby(['label'])
__bar = __progressbar(len(gb))
df_clust_group_info = pd.DataFrame({'size': gb.size()})
df_bus_reviews = df_reviews_and_restaurants.set_index('business_id')
df_restaurant_review_ratio_tps = df_restaurant_review_ratio.transpose()
start_time = time.time()
def get_group_info(cur_cluster):
groupSize = len(cur_cluster)
df_clust_group_info.at[cur_cluster.name, 'size'] = groupSize
df_clust_group_info.at[cur_cluster.name, 'latitude'] = cur_cluster['latitude'].sum()/groupSize
df_clust_group_info.at[cur_cluster.name, 'longitude'] = cur_cluster['longitude'].sum()/groupSize
df_clust_group_info.at[cur_cluster.name, 'city'] = pd.Series(cur_cluster['city'].unique()).str.cat(sep=', ')
df_clust_group_info.at[cur_cluster.name, 'zip'] = pd.Series(cur_cluster['postal_code'].unique()).str.cat(sep=', ')
df_clust_group_info.at[cur_cluster.name, 'neighborhood'] = pd.Series(cur_cluster['neighborhood'].unique()).str.cat(sep=', ')
df_cur_cluster_reviews = cur_cluster[['business_id']].join(df_bus_reviews, on='business_id', how='inner')
df_cur_cluster_unique_users = df_cur_cluster_reviews[['user_id']].drop_duplicates()
df_clust_user_rst_visits = df_cur_cluster_unique_users.join(df_user_rst_visits, on='user_id')
df_clust_group_info.at[cur_cluster.name, 'reviews_count'] = len(df_cur_cluster_reviews)
df_clust_group_info.at[cur_cluster.name, 'user_count'] = len(df_cur_cluster_unique_users)
df_clust_group_info.at[cur_cluster.name, 'total_stars'] = cur_cluster['stars'].sum()
df_clust_group_info.at[cur_cluster.name, 'total_open'] = cur_cluster['is_open'].sum()
for category in top_20_specific_categories:
df_clust_group_info.at[cur_cluster.name, category + ' Supply'] = cur_cluster[category].sum()
df_clust_group_info.at[cur_cluster.name, category + ' Demand'] = df_clust_user_rst_visits[category].sum() \
* df_restaurant_review_ratio_tps.loc['Supply/Demand (Restaurant/Review) Ratio',category]
df_clust_group_info.at[cur_cluster.name, category + ' Demand'] = df_clust_user_rst_visits[category].sum() \
* df_restaurant_review_ratio_tps.loc['Supply/Demand (Restaurant/Review) Ratio',category]
__bar.value += 1
gb.apply(get_group_info)
df_clust_group_info.head().transpose()
df_clust_group_info.to_pickle('df_clust_group_info.pkl')
df_clust_group_info = pd.read_pickle('df_clust_group_info.pkl')
df_clust_group_info.transpose()
For denser cluster population, interval with 20 limits is used to indicate size of each cluster and same number of colors to easily recognize each cluster on the map.
We will try the following classification models to evaluate and find the best algorithm for our classification models = ['LR','LDA','KNN','CART','GNB','MNB','BNB','LSVM','SVM','RF','BAG'] from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.naive_bayes import MultinomialNB from sklearn.naive_bayes import BernoulliNB from sklearn.svm import LinearSVC from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import BaggingClassifier
1. Linear Regression - LR
2. Linear Discriminant Analysis - LDA
3. K Nearest Neighbor - KNN
4. Decision Tree Classifier - CART
5. Gaussian Naiive Bayes - GNB
6. Multinomial Naiive Bayes - MNB
7. Bernoulli Naiive Bayes - BNB
8. Linear Support Vector Machine - LSVM
9. Multilinear Support Vector Machine - SVM
10. Random Forest - RF
11. Bagging (Boosting Aggregations) - BAG
top_20_specific_categories = pd.read_pickle('df_top_20_specific_categories.pkl')['categories'].values
df_clust_group_info = pd.read_pickle('df_clust_group_info.pkl')
df_restaurants_label_filtered = pd.read_pickle('df_restaurants_label_filtered.pkl').reset_index()
# Adding additonal columns to the data from groups, clusters ratio and
# expected demand/supply will be based on this ratio
demand_cats = [x + ' Demand' for x in top_20_specific_categories]
supply_cats = [x + ' Supply' for x in top_20_specific_categories]
local_demand_cats = [x + ' Local Demand' for x in top_20_specific_categories]
display_cats = [x + ' Display' for x in top_20_specific_categories]
# Add new Local Demand columns for each category
df_clust_group_info[local_demand_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])
df_clust_group_info[display_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])
scaler = MinMaxScaler(feature_range=(0, 1))
t = None
for index,row in df_clust_group_info.iterrows():
cluster_supply = row[supply_cats].transpose().sum()
cluster_demand = row[demand_cats].transpose().sum()
cluster_adjustment_ratio = (cluster_supply / cluster_demand) if cluster_demand > 0 else 0
for x in top_20_specific_categories:
localDemand = round(row[x + ' Demand'] * cluster_adjustment_ratio)
df_clust_group_info.at[index, x + ' Local Demand'] = localDemand
# apply (n - min)/(max - min) formula to the difference of Local Demand and Supply to normalize display
diff = row[supply_cats].values - df_clust_group_info.loc[index, local_demand_cats].values
scaled = scaler.fit_transform(diff.astype('float64').reshape(-1,1))
for i in range(len(diff)):
df_clust_group_info.at[index, top_20_specific_categories[i] + ' Display'] = scaled[i]
df_clust_group_info[display_cats].head()
The data below is the merged set of clustered information with individual restaurants informaiton. It contains categories of restaurants as binary columns (20 columns, 1 column per category) for classification. The classification of each category from this data seaprately will be compared with all category as one column from step 2 below.
df_clust_group_info_prefixed = df_clust_group_info.add_prefix('Group ')
merged_restaurant_and_groups = pd.merge(df_clust_group_info_prefixed,df_restaurants_label_filtered, \
left_on = 'label', right_on='label')
display(merged_restaurant_and_groups.head().transpose())
merged_restaurant_and_groups.to_pickle('merged_restaurant_and_groups.pkl')
# Delete columns that cannot added to new restaurant businesses or that will not be used full in a group
merged_restaurant_and_groups = pd.read_pickle('merged_restaurant_and_groups.pkl')
df_data_clean = merged_restaurant_and_groups.drop(['business_id','index', 'categories', \
'latitude', 'longitude', 'Group city','Group zip', 'Group neighborhood'], axis=1)
df_data_clean.to_pickle('df_data_clean.pkl')
print(df_data_clean.shape)
df_data_clean.head().transpose()
import warnings; warnings.simplefilter('ignore')
df_data_clean = pd.read_pickle('df_data_clean.pkl')
# Eliminate categories indicators from the dataset since that's what we are trying to predict
X = df_data_clean[df_data_clean.columns.difference(top_20_specific_categories)]
# Transform strings into equivalent label values
for column in X.columns:
if X[column].dtype == type(object):
le = LabelEncoder()
X[column] = le.fit_transform(X[column])
df_cross_val_results = pd.DataFrame(None, columns=['category', 'score'])
__bar14 = __progressbar(20 * 11)
for cat in top_20_specific_categories:
# Iterate through each category to predict it
y = df_data_clean[cat]
# Use a MinMaxScaler to scale values between 0 and 1
# It is needed by some algorithms such as MultinomialNB
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
print('--{}--'.format(cat))
result = run_classifiers(X = X_scaled, y = y, num_splits = 10, rnd_state = 1, __bar = __bar14)
__bar14.value += 11
df_cross_val_results = df_cross_val_results.append({'category': cat, 'score': result}, ignore_index=True)
df_cross_val_results.to_pickle('df_cross_val_results.pkl')
df_cross_val_results = pd.read_pickle('df_cross_val_results.pkl')
models = ['LR','LDA','KNN','CART','GNB','MNB','BNB','LSVM','SVM','RF','BAG']
df_model_aggr = pd.DataFrame(columns=top_20_specific_categories)
for i, row in df_cross_val_results.iterrows():
for m in range(0,len(models)):
df_model_aggr.at[models[m], row['category']] = row['score'][m].mean()
df_model_aggr.head()
df_model_aggr.transpose().mean().sort_values(ascending=False)
df_model_aggr.to_pickle('df_model_aggr.pkl')
df_model_aggr = pd.read_pickle('df_model_aggr.pkl')
df_draw = df_model_aggr
iplot([{
'x': df_draw.index,
'y': df_draw[col],
'name': col
} for col in df_draw.columns], filename='cufflinks/simple-line2')
df_draw = df_model_aggr.transpose()
iplot([{
'x': df_draw.index,
'y': df_draw[col],
'name': col
} for col in df_draw.columns], filename='cufflinks/simple-line')
Recursive Feature Elimination with Cross Validation: We pick RandomForest Classifier for recursive feature elimination, however, it is clear that over 85% of features somewhat contribute to the results, that's why, we will leave the features alone for individual category classification as a binary.
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
class RandomForestClassifierWithCoef(RandomForestClassifier):
def fit(self, *args, **kwargs):
super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
self.coef_ = self.feature_importances_
nb=RandomForestClassifierWithCoef()
rfecv = RFECV(estimator=nb, step=1, cv=StratifiedKFold(10),
scoring='accuracy')
rfecv.fit(X, y)
print(type(rfecv.grid_scores_))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
The above images shows that Support Vector Machine, Linear Support Vector Machine, Multinomial Naiive Bayes and Logistic Regression have performed the best. However, the confusion matrices for top 3 models below shows that this performance of the algorithms for binary category prediction (one category column at a time) is misleading because even if both True and False values are marked as True, cross validation accuracy score of 80-98% is acheived, however, it doesn't help us categorize restaurants. Therefore, we will have to rely on merged category classification (all categories in a single column)
import warnings; warnings.simplefilter('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix
# Eliminate all category information and binary category columns to prev X data
X = df_data_clean[df_data_clean.columns.difference(np.append(top_20_specific_categories,'category'))]
# Transform strings into equivalent label values
for column in X.columns:
if X[column].dtype == type(object):
le = LabelEncoder()
X[column] = le.fit_transform(X[column])
models = []
models.append(('LR', LogisticRegression()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))
# we will only show confusion matrix for 2 categories because it is for demo only as it did not work anyway
for name, model in models:
for cat in np.take(top_20_specific_categories,[1,6]):
y = df_data_clean[cat]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=120, test_size = 0.3)
start_time = time.time()
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model.fit(X_train_scaled, y_train)
print('Accuracy of MNB classifier on training set: {:.2f}'.format(model.score(X_train_scaled, y_train)))
print('Accuracy of MNB classifier on test set: {:.2f}'.format(model.score(X_test_scaled, y_test)))
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test,y_pred))
display(confusion_matrix(y_test, y_pred))
draw_confusion_matrix(cat, y_test, y_pred)
print('{} - {}'.format(name, total_y_pred))
In merged_restaurant_and_groups dataframe, categories column contains arrays of categories since a single restaurant may have more than 1 categories. In this section, we flatten categories copying restaurant rows so that they can be used for classification for comparision with binary columns classification of categories.
# separate categories into individual rows for classification
merged_restaurant_and_groups = pd.read_pickle('merged_restaurant_and_groups.pkl')
ids = []
catl = []
for i,row in merged_restaurant_and_groups[['business_id', 'categories']].iterrows():
for n in row['categories']:
ids.append(row['business_id'])
catl.append(n)
df_flat = pd.DataFrame({'business_id': ids, 'category': catl})
merged_restaurant_and_groups_flat = pd.merge(df_flat,merged_restaurant_and_groups, \
left_on = 'business_id', right_on='business_id')
merged_restaurant_and_groups_flat.to_pickle('merged_restaurant_and_groups_flat.pkl')
# Delete columns that cannot added to new restaurant businesses or that will not be used full in a group
merged_restaurant_and_groups_flat = pd.read_pickle('merged_restaurant_and_groups_flat.pkl')
df_data_clean2 = merged_restaurant_and_groups_flat.drop(['business_id','index', 'categories', \
'latitude', 'longitude', 'Group city','Group zip', 'Group neighborhood'], axis=1)
df_data_clean2.to_pickle('df_data_clean2.pkl')
print(df_data_clean2.shape)
df_data_clean2.head().transpose()
# run cross validation for combined categories
X = df_data_clean2[df_data_clean2.columns.difference(np.append(top_20_specific_categories,'category'))]
y = df_data_clean2['category']
# Transform strings into equivalent label values
for column in X.columns:
if X[column].dtype == type(object):
le = LabelEncoder()
X[column] = le.fit_transform(X[column])
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
result = run_classifiers(X = X_scaled, y = y, num_splits = 10, rnd_state = 1, __bar = __bar14)
pd.DataFrame({'results': [result]}, columns=['results']).to_pickle('df_cross_val_resutls.pkl')
cross_val_results = pd.read_pickle('df_cross_val_resutls.pkl')['results'][0]
models = ['LR','LDA','KNN','CART','GNB','MNB','BNB','LSVM','SVM','RF','BAG']
mean_cross_val = []
for x in cross_val_results:
mean_cross_val.append(np.mean(x))
mean_cross_val
iplot([{
'x': models,
'y': mean_cross_val,
'name': "Cross Validation Mean"
}], filename='cufflinks/simple-line3')
The above graph shows that LinearRegression is slightly better performant for our analysis than any other classifier, so we will move forward with it for further analysis.
X = df_data_clean2[df_data_clean2.columns.difference(np.append(top_20_specific_categories,'category'))]
# Transform strings into equivalent label values
for column in X.columns:
if X[column].dtype == type(object):
le = LabelEncoder()
X[column] = le.fit_transform(X[column])
models = []
models.append(('LR', LogisticRegression()))
y = df_data_clean2['category']
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
rfecv = RFECV(estimator=lr, step=1, cv=StratifiedKFold(10),
scoring='accuracy')
rfecv.fit(X, y)
print(type(rfecv.grid_scores_))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
# Getting top 30 attributes from the REFCV and running component analysis on them
df_labeled_rankings = pd.DataFrame({'cols': df_data_clean2.columns \
.difference(np.append(top_20_specific_categories,'category')), \
'ranks': rfecv.ranking_}).sort_values(['ranks']).reset_index(drop=True)
df_labeled_rankings.head(30)
import warnings; warnings.simplefilter('ignore')
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.preprocessing import MinMaxScaler
df_data_clean2 = pd.read_pickle('df_data_clean2.pkl')
# Eliminate categories information from the dataset
X = df_data_clean2[df_data_clean2.columns.difference(np.append(top_20_specific_categories,'category'))]
# Keep just the top 30 columns ranked for RFECV above
X = df_data_clean2[df_labeled_rankings['cols'].head(30).values]
#Transform strings into equivalent label values
for column in X.columns:
if X[column].dtype == type(object):
le = LabelEncoder()
X[column] = le.fit_transform(X[column])
model = LogisticRegression()
name = 'LR'
y = df_data_clean2['category']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=120, test_size = 0.3)
start_time = time.time()
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model.fit(X_train_scaled, y_train)
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'.format(model.score(X_train_scaled, y_train)))
print('Accuracy of Logistic Regression classifier on test set: {:.2f}'.format(model.score(X_test_scaled, y_test)))
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test,y_pred))
display(confusion_matrix(y_test, y_pred))
draw_confusion_matrix_all(top_20_specific_categories, y_test, y_pred)
This is a convenience function runs 11 most popular classifier and returns the results on the dataset.
import warnings; warnings.simplefilter('ignore')
import seaborn as sns
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
def run_classifiers(X, y, num_splits, rnd_state, __bar):
seed = 1
# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('MNB', MultinomialNB()))
models.append(('BNB', BernoulliNB()))
models.append(('LSVM', LinearSVC()))
models.append(('SVM', SVC()))
models.append(('RF', RandomForestClassifier()))
models.append(('BAG', BaggingClassifier()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=num_splits, random_state=seed)
cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
__bar.value += 1
return results
from pylab import rcParams
def draw_confusion_matrix(category, y_test, y_pred):
rcParams['figure.figsize'] = 20, 20
faceLabels = ['Not {} (0)'.format(category),'{} (1)'.format(category)]
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels = faceLabels, cmap="BuPu", linecolor='black', linewidths=1,
yticklabels = faceLabels)
plt.xlabel('Actual')
plt.ylabel('Predicted');
plt.show()
def draw_confusion_matrix_all(categories, y_test, y_pred):
rcParams['figure.figsize'] = 20, 20
faceLabels = categories
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels = faceLabels, cmap="BuPu", linecolor='black', linewidths=1,
yticklabels = faceLabels)
plt.xlabel('Actual')
plt.ylabel('Predicted');
plt.show()
# data imports
df_clust_group_info = pd.read_pickle('df_clust_group_info.pkl')
df_restaurants_label_filtered = pd.read_pickle('df_restaurants_label_filtered.pkl')
top_20_specific_categories = pd.read_pickle('df_top_20_specific_categories.pkl')['categories'].values
display(df_clust_group_info.head(2))
display(df_restaurants_label_filtered.head(2))
display(top_20_specific_categories)
Define 10 and 20 graphing limits and their equivalent colors on the maps
mapbox_access_token = 'pk.eyJ1IjoiZjhheml6IiwiYSI6ImNqb3plOWp6MjA0bXIzcnFxczZ1bjdrbmwifQ.5qd5W4B06UUZc20Jax12OA'
#interval_10 = pd.interval_range(start=4, periods=10, freq=4, closed='both').to_tuples()
limits_10 = [(4,10),(11,20),(21,30),(31,40),(41,50),(51,70),(71,100),(101,200),(201,400),(401,2000)]
colors_10 = ['#0000FF', '#008080', '#FF0000', '#008000', '#808000', '#000080', '#C36900', \
'#FF00FF', '#800080','#00FF00']
# interval_20 = pd.interval_range(start=4, periods=20, freq=2, closed='both').to_tuples()
limits_20 = [(4,5),(6,10),(11,15),(16,20),(21,25),(26,30),(31,35),(35,40),(41,45),(45,50),(51,60), \
(61,70),(71,80),(81,100),(101,150),(151,200),(201,300),(301,400),(401,1000),(1001,2000)]
colors_20 = ['RGB(230,25,75)','RGB(60,180,75)','RGB(255,225,25)','RGB(67,99,216)','RGB(245,130,49)', \
'RGB(145,30,180)','RGB(70,240,240)','RGB(240,50,230)','RGB(188,246,12)','RGB(250,190,190)', \
'RGB(0,128,128)', 'RGB(230,190,255)','RGB(154,99,36)','RGB(255,250,200)','RGB(170,255,195)', \
'RGB(255,216,177)','RGB(0,0,117)','RGB(128,128,128)','RGB(128,0,0)','RGB(128,128,0)']
For city level map, 10 interval limits and 10 colors are used to indicate each cluster's size and color
label_sizes = df_restaurants_label_filtered[['business_id','label']].groupby(['label']).agg(['count'])
label_sizes['business_id']['count'].nlargest(10)
Massaging of the data for display purposes only
df_clust_group_info.head()
df_clust_group_info['label'] = df_clust_group_info.index
for index,row in df_clust_group_info.iterrows():
df_clust_group_info.at[index, 'neighborhood'] = (row['neighborhood'][:50] + (row['neighborhood'][:50] and '...'))
df_clust_group_info.at[index, 'zip'] = (row['zip'][:50] + (row['zip'][:50] and '...'))
clusters = []
scale = 1
for i in range(len(limits_20)):
lim = limits_20[i]
df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
& (df_clust_group_info['size'] <= lim[1]))]
cluster = dict(
type = 'scattergeo',
locationmode = 'USA-states',
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = 'City: ' + df_sub['city'] + \
'<br>Neighborhood(s): ' + df_sub['neighborhood'] + \
'<br> Zip/Postal Code(s):' + df_sub['zip'],
sizemode = 'diameter',
marker = dict(
size = [i*scale]*len(df_sub),
color = colors_20[i],
line = dict(width = 2,color = 'black')
),
name = '{0} - {1}'.format(lim[0],lim[1]) )
clusters.append(cluster)
layout = dict(
title = 'Yelp Reviewed Restaurants in North America',
showlegend = True,
geo = dict(
scope='north america',
projection=dict( type='albers usa canada' ),
resolution= 50,
lonaxis= {
'range': [-150, -55]
},
lataxis= {
'range': [30, 50]
},
center=dict(
lat=43.6543,
lon=-79.3860
),
showland = True,
landcolor = 'rgb(217, 217, 217)',
subunitwidth=1,
countrywidth=1,
subunitcolor="rgb(255, 255, 100)",
countrycolor="rgba(255, 200, 255)"
),
)
fig = dict( data=clusters, layout=layout )
display(HTML('<a id="north_america_clustered">North America Clustered Restaurants by Location (All Categories)</a>'))
iplot( fig, validate=False, filename='d3-bubble-map-populations' )
df_clust_group_info = pd.read_pickle('df_clust_group_info.pkl')
clusters = []
scale = 3
for i in range(len(limits_10)):
lim = limits_10[i]
df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
& (df_clust_group_info['size'] <= lim[1]))]
cluster = dict(
type = 'scattergeo',
locationmode = 'USA-states',
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = 'City: ' + df_sub['city'] + \
'<br>Size: ' + df_sub['size'].astype(str) + \
'<br>Neighborhood: ' + df_sub['neighborhood'] + \
'<br>Postal Code:' + df_sub['zip'],
sizemode = 'diameter',
marker = dict(
size = [i*scale]*len(df_sub),
color = colors_10[i],
line = dict(width = 2,color = 'black')
),
name = '{0} - {1}'.format(lim[0],lim[1]) )
clusters.append(cluster)
layout = dict(
title = 'Yelp Reviewed Clustered Restaurants in Toronto',
showlegend = True,
geo = dict(
scope='north america',
projection=dict( type='albers usa canada', scale=500 ),
resolution= 50,
lonaxis= {
'range': [-130, -55]
},
lataxis= {
'range': [30, 50]
},
center=dict(
lat=43.6543,
lon=-79.3860
),
showland = True,
landcolor = 'rgb(217, 217, 217)',
subunitwidth=1,
countrywidth=1,
subunitcolor="rgb(120, 120, 120)",
countrycolor="rgb(255, 255, 255)"
),
)
fig = dict( data=clusters, layout=layout )
display(HTML('<a id="toronto_clustered">All Clustered Restaurants on Sketch (Toronto)</a>'))
iplot( fig, validate=False, filename='d3-bubble-map-populations' )
clusters = []
scale = 4
for i in range(len(limits_10)):
lim = limits_10[i]
df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
& (df_clust_group_info['size'] <= lim[1]))]
cluster = go.Scattermapbox(
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = 'Cluster #: ' + df_sub.index.astype(str) + \
'<br>Size: ' + df_sub['size'].astype(str) + \
'<br>City: ' + df_sub['city'] + \
'<br>Neighborhood: ' + df_sub['neighborhood'] + \
'<br>Postal Code:' + df_sub['zip'],
mode = 'markers',
marker = dict(
size = [i*scale]*len(df_sub),
color = colors_10[i]
),
name = '[{0} - {1}]'.format(lim[0],lim[1]) )
border = go.Scattermapbox(
lon = df_sub['longitude'],
lat = df_sub['latitude'],
mode='markers',
marker=dict(
size=[i * scale + 1]*len(df_sub),
color='black',
opacity=0.4
),
hoverinfo='none',
showlegend=False)
clusters.append(border)
clusters.append(cluster)
layout = go.Layout(
title = 'Yelp Reviewed Clustered Restaurants on Toronto Map',
autosize=True,
hovermode='closest',
mapbox=dict(
accesstoken=mapbox_access_token,
bearing=0,
center=dict(
lat=43.6543,
lon=-79.3860
),
pitch=0,
zoom=12
),
)
fig = dict(data=clusters, layout=layout)
display(HTML('<a id="toronto_clustered_map">All Clustered Restaurants on Map (Toronto)</a>'))
iplot(fig, filename='Multiple Mapbox')
demand_cats = [x + ' Demand' for x in top_20_specific_categories]
supply_cats = [x + ' Supply' for x in top_20_specific_categories]
local_demand_cats = [x + ' Local Demand' for x in top_20_specific_categories]
display_cats = [x + ' Display' for x in top_20_specific_categories]
# Add new Local Demand columns for each category
df_clust_group_info[local_demand_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])
df_clust_group_info[display_cats] = pd.DataFrame([[np.nan] * len(top_20_specific_categories)])
scaler = MinMaxScaler(feature_range=(0, 1))
t = None
for index,row in df_clust_group_info.iterrows():
cluster_supply = row[supply_cats].transpose().sum()
cluster_demand = row[demand_cats].transpose().sum()
cluster_adjustment_ratio = (cluster_supply / cluster_demand) if cluster_demand > 0 else 0
for x in top_20_specific_categories:
localDemand = round(row[x + ' Demand'] * cluster_adjustment_ratio)
df_clust_group_info.at[index, x + ' Local Demand'] = localDemand
# apply (n - min)/(max - min) formula to the difference of Local Demand and Supply to normalize display
diff = row[supply_cats].values - df_clust_group_info.loc[index, local_demand_cats].values
scaled = scaler.fit_transform(diff.astype('float64').reshape(-1,1))
for i in range(len(diff)):
df_clust_group_info.at[index, top_20_specific_categories[i] + ' Display'] = scaled[i]
df_clust_group_info[display_cats].head()
clusters = []
scale = 4
colors = ['maroon', 'purple', 'navy', 'teal', 'olive']
for x in range(0, len(top_20_specific_categories)):
cat = top_20_specific_categories[x]
for i in range(len(limits_10)):
lim = limits_10[i]
df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
& (df_clust_group_info['size'] <= lim[1]))]
demandStr, supplyStr = '{} Demand'.format(cat), '{} Supply'.format(cat)
local_sd_ratio = df_sub[demandStr].max() / df_sub[supplyStr].max()
cluster = go.Scattermapbox(
lon = df_sub['longitude'],
lat = df_sub['latitude'],
text = 'Category: {}'.format(cat) + \
'<br>Size: ' + df_sub['size'].astype(str) + \
'<br>City: ' + df_sub['city'] + \
'<br>Demand: ' + df_sub['{} Local Demand'.format(cat)].astype(str) + \
'<br>Supply: ' + df_sub['{} Supply'.format(cat)].astype(str) + \
'<br>Neighborhood: ' + df_sub['neighborhood'] + \
'<br>Postal Code:' + df_sub['zip'],
mode = 'markers',
marker = dict(
size = [i*scale]*len(df_sub),
color = colors[x % 5],
opacity = df_sub['{} Display'.format(cat)]
),
name = '[{0} - {1}]'.format(lim[0],lim[1]) ,
visible= (False if x > 0 else True)
)
clusters.append(cluster)
# add border for all clusters
for i in range(len(limits_10)):
lim = limits_10[i]
df_sub = df_clust_group_info[((df_clust_group_info['size'] >= lim[0]) \
& (df_clust_group_info['size'] <= lim[1]))]
border = go.Scattermapbox(
lon = df_sub['longitude'],
lat = df_sub['latitude'],
mode='markers',
marker=dict(
size=[i * scale + 1]*len(df_sub),
color='black',
opacity=0.1
),
hoverinfo='none',
visible=True,
showlegend=False)
clusters.append(border)
steps = []
trc_count = 10
traces_per_category = 10
category_size = len(top_20_specific_categories)
v = [False] * traces_per_category * category_size + [True] * traces_per_category
for i in range(0, category_size):
step = dict(method='restyle',
args=['visible', v[0:i * trc_count] + [True] * trc_count + v[ (i+1) * trc_count: len(v)]],
label='{}'.format(top_20_specific_categories[i]))
steps.append(step)
sliders = [dict(active=0,
pad={"t": 1},
steps=steps)]
layout = go.Layout(
title = 'Yelp Reviewed Restaurants Supply/Demand by Category Slider',
autosize=True,
hovermode='closest',
mapbox=dict(
accesstoken=mapbox_access_token,
bearing=0,
center=dict(
lat=43.6543,
lon=-79.3860
),
pitch=0,
zoom=12
),
sliders = sliders
)
fig = dict(data=clusters, layout=layout)
display(HTML('<a id="toronto_clustered_categorized">Slider Controlled Categories Displaying Demand (Toronto)</a>'))
iplot(fig, filename='Multiple Mapbox')